{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/openai/chatgpt/plugins/ask-lex/ask-lex-indexer.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/openai/chatgpt/plugins/ask-lex/ask-lex-indexer.ipynb)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "ZNhITaNDNfGC"
      },
      "source": [
        "# Indexing for the Ask Lex ChatGPT Plugin\n",
        "\n",
        "This notebook works through the indexing process for processing videos from a YouTube channel (in this case Lex Fridman) and storing them inside a Pinecone vector DB to be used by a retrieval ChatGPT plugin.\n",
        "\n",
        "To begin we install prerequisite libraries and setup our API keys."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "myDhU_bLGelE"
      },
      "outputs": [],
      "source": [
        "!pip install -qU openai pod-gpt datasets git+https://github.com/openai/whisper.git"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "mEL_yTQYwyp8"
      },
      "source": [
        "Now enter API keys. Note that this will cost some money for creating the embeddings via OpenAI unless within their free usage credits."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2MSBFtHOwlZS"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "\n",
        "# get openai api key at platform.openai.com\n",
        "OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\") or \"OPENAI_API_KEY\"\n",
        "# get pinecone api key at app.pinecone.com\n",
        "PINECONE_API_KEY = os.getenv(\"PINECONE_API_KEY\") or \"PINECONE_API_KEY\"\n",
        "# find your environment next to the api key in pinecone console\n",
        "PINECONE_ENV = os.getenv(\"PINECONE_ENVIRONMENT\") or \"PINECONE_ENVIRONMENT\""
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "8lR2qK0jeI2c"
      },
      "source": [
        "## Downloading and Transcribing (Optional)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "JNAjDbJJxr1h"
      },
      "source": [
        "This section is completely optional as you can just jump ahead to the next section and use the prebuilt `jamescalam/lex-transcripts` dataset. If choosing to run this section note that it will likely take multiple days."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0d_Gz_TpCG-m"
      },
      "outputs": [],
      "source": [
        "import pod_gpt\n",
        "\n",
        "channel = pod_gpt.Channel(\n",
        "    channel_id='UCSHZKyawb77ixDdsGog4iWA',  # lex fridman YT channel\n",
        "    api_key=YOUTUBE_API_KEY  # Google YouTube API\n",
        ")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "RCB-nh_ux9CK"
      },
      "source": [
        "You can return a specific number of videos by setting `max_results` parameter. To return **all** videos just remove the parameter to use `channel.get_videos_info()`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3giVv7otCYui"
      },
      "outputs": [],
      "source": [
        "channel.get_videos_info(max_results=2)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "48cxe3GGyLRJ"
      },
      "source": [
        "Load the Whisper audio transcription model:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SlQTZpNKlqXE"
      },
      "outputs": [],
      "source": [
        "import torch\n",
        "import whisper\n",
        "\n",
        "# prep whisper model\n",
        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
        "print(device)\n",
        "\n",
        "model = whisper.load_model(\"large\").to(device)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "0qtiD94SyP2z"
      },
      "source": [
        "Transcribe audio:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SWCNCE-FlqUk"
      },
      "outputs": [],
      "source": [
        "channel.transcribe_videos(model)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "wXJou8SsySX0"
      },
      "source": [
        "Save transcribed audio to local JSONL file:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lLpNhFW8lqRv"
      },
      "outputs": [],
      "source": [
        "channel.save()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "n1cs2FRGv1E2"
      },
      "source": [
        "## Using Prebuilt Dataset\n",
        "\n",
        "Rather than going through the process above, we can skip ahead a little with a prebuilt Lex dataset from Hugging Face. We load it like so:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "T4LE-o7Jv_b_",
        "outputId": "60265c69-7372-45e0-f060-10cf65eea020"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "WARNING:datasets.builder:Found cached dataset json (/root/.cache/huggingface/datasets/jamescalam___json/jamescalam--lex-transcripts-6a9688b7915283fe/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "Dataset({\n",
              "    features: ['video_id', 'channel_id', 'title', 'published', 'transcript', 'source'],\n",
              "    num_rows: 499\n",
              "})"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "data = load_dataset(\n",
        "    'jamescalam/lex-transcripts',\n",
        "    split='train'\n",
        ")\n",
        "data"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "rAWAUsH0yY-f"
      },
      "source": [
        "## Indexing"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "zIh5LDopycVz"
      },
      "source": [
        "Now that we have the dataset ready we can begin indexing it in Pinecone using OpenAI's `text-embedding-ada-002` model. To begin we initialize a `pod_gpt` `indexer`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "84bj1AW5lqO3"
      },
      "outputs": [],
      "source": [
        "import pod_gpt\n",
        "\n",
        "indexer = pod_gpt.Indexer(\n",
        "    openai_api_key=OPENAI_API_KEY,\n",
        "    pinecone_api_key=PINECONE_API_KEY,\n",
        "    pinecone_environment=PINECONE_ENV,\n",
        "    index_name=\"ask-lex\"\n",
        ")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "BFbd6pTVymA0"
      },
      "source": [
        "Now we index our data:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "57397d75dbb44869820bffd80ea17414",
            "63d08638581b42f5a3feaa9aeccc6bb9",
            "2c5d340d164d4eb59f6960582bd003e3",
            "b15c072fbff74b8e8fc00ec8a8a3d5db",
            "3efc38c029ef440c800639bb66be6716",
            "54da1a1fcb37410da0db390624a01a15",
            "203e675327534c1db4a3008e6dc341da",
            "b6ca1737ac344b8a8c27b3ac30ec95ed",
            "0f3da69d99d94f6abc656832bfc255cc",
            "7a528b9fa499414aae9667cf9b433b49",
            "1a73d6cf31a248dea4cba588e0a74334"
          ]
        },
        "id": "_sMiT9cEHDEs",
        "outputId": "7e401b7c-e74c-40db-f4af-62a958517961"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "57397d75dbb44869820bffd80ea17414",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/499 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from tqdm.auto import tqdm\n",
        "\n",
        "for row in tqdm(data):\n",
        "    row['published'] = row['published'].strftime('%Y%m%d')\n",
        "    indexer(pod_gpt.VideoRecord(**row))"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "nheGZ14oyshg"
      },
      "source": [
        "Once complete we can move on to building the remainder of the plugin. Please see [this video](https://youtu.be/bAQ6VRewf0w) for a more detailed walkthrough."
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "0f3da69d99d94f6abc656832bfc255cc": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "1a73d6cf31a248dea4cba588e0a74334": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "203e675327534c1db4a3008e6dc341da": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "2c5d340d164d4eb59f6960582bd003e3": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_b6ca1737ac344b8a8c27b3ac30ec95ed",
            "max": 499,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_0f3da69d99d94f6abc656832bfc255cc",
            "value": 499
          }
        },
        "3efc38c029ef440c800639bb66be6716": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "54da1a1fcb37410da0db390624a01a15": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "57397d75dbb44869820bffd80ea17414": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_63d08638581b42f5a3feaa9aeccc6bb9",
              "IPY_MODEL_2c5d340d164d4eb59f6960582bd003e3",
              "IPY_MODEL_b15c072fbff74b8e8fc00ec8a8a3d5db"
            ],
            "layout": "IPY_MODEL_3efc38c029ef440c800639bb66be6716"
          }
        },
        "63d08638581b42f5a3feaa9aeccc6bb9": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_54da1a1fcb37410da0db390624a01a15",
            "placeholder": "​",
            "style": "IPY_MODEL_203e675327534c1db4a3008e6dc341da",
            "value": "100%"
          }
        },
        "7a528b9fa499414aae9667cf9b433b49": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "b15c072fbff74b8e8fc00ec8a8a3d5db": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_7a528b9fa499414aae9667cf9b433b49",
            "placeholder": "​",
            "style": "IPY_MODEL_1a73d6cf31a248dea4cba588e0a74334",
            "value": " 499/499 [06:56&lt;00:00,  1.74it/s]"
          }
        },
        "b6ca1737ac344b8a8c27b3ac30ec95ed": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        }
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
